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The purpose of this module is to provide the 
researcher and the consumers of such research with a recognition of 
•.he assumptions and appropriate use and interpretation of each of 10 
multiple comparison (MC) statistical techniques. In this / 
self-contained and self-instructional module, the user is sensitifzed 
to the serious consequences of inappropriate multiple comparison use 
by employing all MC methods to the same data. He then is introduced 
*o the criteria for selecting the best MC method for a given purpose. 
Computational considerations follov with self-instructional exercises 
and mastery tests. A familiarity with the t-test and one-factor 
analysis of variance is required. (Author/SE) 
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NCERD Reporting Form — Developmental Products 



I. Na«« of Product 

Instructional Module on ftjltijl 
Comparison Techniques in 
Research. 



2. loborotory or Conlor 

le 

(LER) 



3. He port Preparation 

Dat0 prepared 1 1/9/73 
Reviewed by K.D^HowWllS. 



riirprtnr 



4. Problem Description of the educational problem this product designed to solve. 

Many research studies are Inappropriate and inefficient statistical procedures 
for comparing three or more groups. Applied textbooks are not adequate in providing 
the needed competencies for selection of the most powerful multiple comparison 
(MC) method that will answer the researcher's questions. 




S* Strategy* xhe general strategy selected for the solution of the problem above* 

f 

Tfie training strategy is: (1) to survey current experimental statistics 
textbooks to illustrate the characteristic^ uneven and inappropriate coverage; 
(2) contrast the grossly different conclusions that will result depending on MC 
method used; (3) to provide a flowchart guide to selecte the MC method of choice; 
(4) to provide self-instructional exercises for developing needed user competencies, 



•6- teleoie Octet Approximate date 
product uus for uill be) ready 
for release to next agency. 

12/1/73 



7. taval of Development* (Tnar>cu?ter- 
ietia level (or projected lewl) 
of development of product at time 
of release* Cneck one. 

Ready for critical review and for 

preparation for Field Teat 
(i.e. prototype wteriale) 
X Ready for Field Test 

Beady for publisher modification 

j>Ready for general dissemination/ 
diffusion 



8. Next Agency: Au.nc r u 
+T? lu?t (:r vtZ 
released fcr f*rt\cr 
development iif;\*sicr. 



NIE 



10.71A (D) 
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9, Product Dticripttont Deecribe the fotlooing; number each description. 

m Characteristic* &f the product. • Astooiated products, if any. 

« 2. JStou £t uork*. • 5» Special conditions, time, training, 

^ . t# , ^ • . . A . . _ . •outpmsnt and/or other requirement* 

\ • 3. Vhat %t vm intended to do. jg^ * 



Characteristics of the Product : 

A 34 page discussion of various MC procedures — their assumptions, consequences, 
and interpretations. The module is self-contained and self-instructional. (See 
also tt 5. Strategy") 

How it Works : 

The user 1s sensitized to the serious consequences of inappropriate MC use by 
employing all MC methods to the same data. He then is introduced to the criteria 
for selecting the best MC method for a given purpose. Computational considerations 
follow with self-instructional exercises and mastery tests. 

What it 1s Intended to do : 

Provide the research producer and consumer with a recognition of the 
assumptions and appropriate use and interpretation of each ten MC techniques. 

Requirements for Use : 

A familiarity with the t-test and one- factor analysis of variance. 
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10. Product Uitm Itoa* indtuJcfuaZ* or jwupa expected to use the product. 

The product is intended to be used by applied researchers in education and by 
students in intermediate courses in statistics or experimental design. 



II Product Owtcom#ii The changes in user behavior, attitudes , efficiency, et9k. resulting 
fr<n product use, as supported £v data . Plea** ait* relevant support documents, rf 
clains for the prxluet are rz* ;-*t supported hi* empirical evidence pwisi 3i :»t/i> r • 

Of twenty-eight users responding to anonymous evaluation forms, 46% rated the 
materials as "very good"; 39% rated them as "good"; and only 14% rated them as 
"fair" or "P^r." The "median error rated was 7.5%. 

To the question, "Are the materials superflous, i.e., are there other sources that 
accomplish the same purposes that are as good or better?", 85% of the responses ¥ 
were "No." 

The rating of "good" or "very good" by 85% of the users suggests instructional 
value. 



12. Potential Educational Cons«qu«nc«t: Discuss rat only t*i« theoretical /i.e. conceivable) 
implications of yjur product but also the -nore probable vtrlicationa of jour product, 
especial I j over the next decade. . 

\ 

The use of this product is expected to result in more appropriate use and 
interpretation of research studies involving three or more groups. 
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13. Product etaamtat 

Li$t the elements which aonatiiut* the" product. 


14. Origin. 

Circle the moat 
appropriate letter. 


One self-contained and self-Instructional modui* 
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D- Developed 
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Am Adopted 


15. Startup Costtt Total expected casta to procure, 
install and initiate use of the product. 

Reproduction cos* only. 


16. Operating Coitit Projected costs for continuing 
use of product after initial adoption and 
installation (i.e.^fees, consumable supplies, 
special staff, training, etc.). 

Reproduction. 


17. Uktly Marked What ia the likely market for this product? Consider the size and type of 
the user group; number of possible eubetitute (competitor) products on the market; and 
the likely availability of funds to purchase product by (for) the product user group. 

Research and evaluation personnel, especially those being trained on the job. 
Students 1n Intermediate statistics and experimental design courses. 
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INSTRUCTIONAL MODULE ON MULTIPLE COMPARISON TECHNIQUES IN RESEARCH 
I. A Guide for Selecting The "Method of Choice" 3 



This module has two major components, the first deals with the particular 
advantages and disadvantages of each, the second presents computational 
interrelationships of the various procedures. 

The need for a researcher's guide to the use of multiple comparison (MC) 
techniques is illustrated by recent studies by Tringo (1970) and Wilson (1971). 
Although t.hese are not poor studies, they illustrate the two extremes in 
their selection and use of a MC technique. Tringo (1970) used multiple 
t-tests to make comparisons among seven groups; the multiple t-tests produced 
has an inordinately high risk of falsely rejecting a true null hypothesis. 
Wilson (1971) employed the Scheffe test to detect significant differences 

♦ 

among three means; this method is the most conservative and least, powerful 
of all MC methods for contrasting pairs of means. 

When there are more than two treatment or comparison groups being studied, 
the analysis of variance (ANOVA) or covariance (ANCOVA) will determine whether 

a Adapted from a forthcoming article in the Journal of Special Education. 
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or not the differences among means are greater than expected from chance alone. 
ANOVA or ANCOVA does not, however, proceed to the next logical step of identifying 
which differences among the means are significant; this is the task of multiple 
comparison techniques.. 

Multiple comparison techniques are a relatively recent development in the 
area of statistical analysis which have direct applicability in behavioral 
research. Dissemination via applied statistics textbooks has reflected the 
expected theory-to-practice lag and, in the main, the information exchange has 
been based more in inertia and precedent than actual research utility.- 

The lack of systematic textbook coverage of MC methods is illustrated in 
figure 1 which given the methods covered by popular applied statistics or 
experimental design textbooks. Notice that the Scheffe method is the: MC 
technique. most commonly treated, yet it is the least powerful MC procedure for 
responding to typtcal research questions. 

Multiple comparisons are a not-closely-related family of techniques except 
that they serve a common purpose. This diversity no doubt has contributed to 
the uneven textbook coverage. Whereas there is a major pathway that leads the 
learner through the analysis of variance, when he encounters the domain of 

ft, 

multiple comparisons, the pathway branches into a network of numerous unmarked 
routes. Each MC method has unique advantages and disadvantages. Ideally, the 
researcher should be familiar with the major alternatives so that the method 
can be selected that yields maximum power for the questions, J\ .e. , so that "the 
method of choice" will be chosen. In addition, this information is useful in 
interpreting published research. 

All too frequently, the MC technique employed in a study is one with which 
the researcher .s familiar because it "happened" to be treated in the researcher's 
favorite reference. As a consequence, inappropriate, weak, or at least 
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inefficient methods of analysis are frequently used. Differences in the conclusions 

reached in a given study can vary markedly depending on the MC technique 

employed . In the derivations of the MC methods, different assumptions and 

restrictions are imposed. As a general rule, the more limitations the researcher 

can live with, the more powerful will be. the statistical tests for the hypothecs 

i 

of interest if the proper MC alternative is chosen. 1 

The differences among the various multiple comparison techniques will be 
illustrated from an actual study (Hopkins, 1964) that examined the pattern of 
performance of 33 diagnosed neurological ly handicapped children (ages 6-12) 
on eleven subtests of the Wechsler Intelligence Scale for Children (WISC). 
The results of the Subtests-by-Subjects analysis of variance revealed a highly 
significant difference among subtest means (F!= 32.92/6.99 = 4.71, p < .001). 
The subtest-means are graphically presented iri Figure 2. 

To illustrate the great variation in conclusions as a consequence of the 
MC techniques employed, all possible differences in pairs of means were tested 
for significance using the various MC alternatives: muHiple t-test, Duncan's 

* 

New Multiple Range Test, Newman-Keuls test, Jukey test, Dunn test, Marascuilo 
test, and Scheffe test. Of the possible 55 comparisons of pairs of means, the 

number of significant mean- differences at the .05 and .01 levels for each method 

• S 

differs greatly as is shown'in Table. 1. For example^with a = .05, the number 
of null hypotheses rejected varied from 1 using the Scheffe to 24' for the Duncan 
and multiple t-tests; with a ='.01, the number of significant differences in 
means varied from 0 to 15. How can such inconsistency in conclusions result from 
the use of alternative MC approaches? 

ot-Consi derations . Even though each method has the same nominal a-value, 
not all are appropriate in this situation. Although commonly used, the multiple-t 
approach is never the method of choice and cannot be recommended. Multiple 

J 
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t-tests (also known as the "least significant di f f erence" .or lsd procedure) 
introduces an inextricable pattern of dependency, and yields inaccurate pro- 
bability statements regarding the null hypotheses. The inaccuracy is magnified 
in direct proportion to the number of means in the set being examined. 
In the present example with 11 means, 55 different t-tests would be required to 
test all combinations of pairs. Even if all pair-wise null hypotheses were true, 
more likely than not, the lowest vs. the highest mean from the 11 subtests wpuld 
yield a t-ratio that would be ruled "significant" at the .05 level. 

The Duncan method has the peculiar property of using a fluctuating a-rate 
.depending on the number of means in the set b'eing examined. The true probability 
of a type-I error (rejecting a true null hypothesis) is always larger than the 
tabled a-value except when there are only two means in the- set being tested. For 
this reason the authors view the Duncan procedure as never the method of choice, 
in spite of its popularity. For example, if the Duncan method was used to 
test H Q : uj = (the smallest and largest means in our sample), the critical 
value for the a = .05 /alue in Duncan's table, will be exceeded 40% of the time 

f 

even when the null hypothesis is true (almost as often as with the multiple t 
approach)! in other words the. true probability of a type-I error (incorrectly 
rejecting a true null hypothesis) is not what most users naturally assume, e.g., 
.05, but much larger — .40 in our example. 

The remaining techniques given in Table 1 have accurate a-values, but a 
in relation to what? In the Newman-Keuls method, a is .05 for each individual 
null hypothesis (H Q ) tested, i.e., a contrast based error rate. In the Dunn, 
Dunnett, Tukey, MaVascuilo, 3nd Scheffe methods, a is .05 for the entire set 
or family of H Q 's to be. tested in the experiment i.e.^an experiment based 
error rate. 

/ ' 

t 



Table 1 



Number of Significant Differences (of the 55 possible) Between Pairs of 
WISC Subtest Means for Various Multiple Comparison Methods 



MC Method 


Number of H 


0 *s Rejected 


Percent of K *s 


Rejected 


@ a * .05 


@ a * .01 


0 a a .05 @ 


a = .01 


Multiple t (LSD) a 


24 


15 


44% 


11% 


Duncan a 


24 * 


11 


44% 


20% 


Newman-Keuls 


11 


6 


20% 


11% 


Tukey - 


9 


4 


16% 


7% 


Dunn 


7 


4 


13% 


7% 


Marascuilo 


3 


1 


5% 


2% 


Scheffe 


1 


0 


2% 


0% 



d For these methods the actual probability of a type-I error is considerably 
greater than the tabled, nominal a-value. 



The Newman- Keuls method will tend to reject more pafr-wise H Q 's than the 
other accurate (with respect to type-i error probabilities) methods because of 
its differently "based error rate. It should be noted however, tfrgat the critical 
value for Tukey and Newman- Keuls methods will always be equal when testing the 

extreme-most means,- i.e., when H Q : u smallest = largest 1s being tested; nence 
they will always lead to the same conclusion for this H Q . Thus, ^though the 
Newman-Keuls procedure has, a contrast based error rate, in this limited sense 
the Newman -Keul-s me^hod^has an experiment based error rate, i.e., it, and the . 
Tukey method, wi/l be expected to make a type-I error when testing the extreme-most 
means in 5* of the experiments in which all pair-wise H Q 's are true. However, in 
these »5 of the experiments, when going on to test other pairs of means the 
Newman-^euls method will tend to make more type-I errors than will the Tukey 
procedure . 

Of the common procedures, the Scheffe method is the least powerful for 
detecting differences between pairs of means. It is best, however,, for data 
snooping and testing complex hypotheses (hypotheses involving more than two 
means). The Marascuilo (1966) method is a rather recently devised MC procedure 
appropriate for studies employing large samples. Unlike the others it does not 
assume homogeneity of variance, but does require large samples. It is more- 
useful for making multiple comparisons among correlation coefficients and 
proportions. 

The relative power and sensitivity of the various MC alternatives are also 
illustrated in' Figure 3. In this figure, the magnitudes of minimum differences 
between the WISC subtest means required for significance for the various MC 
method* are graphically depicted. Equi valently, the relative magnitudes for the 
associated confidence intervals are illustrated in Figure 3 (except for the 
, Duncan and Newman-Keuls methods which do not lend themselves to interval 
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estimation). For those methods which do not have a single critical value 
required for all mea/i differences, the greatest and least values are given. For 
example, for the largest difference between pairs of means, (i.e., the difference 
between the Arithmetic and Picture Completion means as illustrated in Figure 2) 
a value of 2.13 is required to reject the null hypothesis for both the Tukey 
and Newman-Keuls methods, yet the latter requires a mean difference of only 
1.28 for adjacently ordered means. 

Clearly such disparity in results is undesirable, but how. does one go about 

selecting the optimum procedure for a given research study? Figure 4, which is 

r 

a revision of an early schema (Hopkins and Chadbourn, 1967), gives a flow 
chart to illustrate the critical decisions leading to the method of choice in 
a given research situation. 

i 

Criteria for Selecting a Multiple Comparison Method ~~ « 

I 

Since the treatment of miltiple comparisons is scattered among many 
sources, the flow chart given in Figured is provided to assist the researcher 
in the selection of an appropriate method for use in examining differences 
between means when more than two groups are involved. In words, the schema 
illustrates the following decisions. 

1. All methods except Marascuilo's* assume homogeneity of variances; this, 
assuption should be tested, since unlike ANOVA these procedures do not 
appear to be robust to non-homogeneity of variances (Petri novich and 
Hardyck, 1969), especially with unequal sample sizes. 

^he large sample method described by Marascuilo (1966) is needed when making 
multiple comparisons Wong correlation coefficients, proportions and contingency 
tables, and is recommi|ided for contrasting means only when variances are not 
homogeneous. 
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2. Make planned orthogonal contrasts if they will answer the relevant 
hypotheses (usually this will not be the case). Each comparison would have 
the contrast as the base for a (see 3 below). The setting of a should not 
be aWitrary, but influenced by power considerations (Hopkins, 1972). 

3. If all comparisons of interest pit the control group against each of the 
other J-l groups, use the Dur.nett procedure. Using the Dunnett technique, 
the probabi/h'ty of a type-I erro/ is a for-tyie set of J - 1 tests^, i.e., 
an experiment-based a- value. 

4. If the number of comparisons is relatively few, (e.g., 2(J - 2) or less), 

use the Dunn test. The Dunn test is appropriate for simple (involving 

\ 

only two means', i.e., a pair of means) and complex (involving more than 
■ two means), and has an experimenfr-kased a-value. 

5. Compare F-ratio differences among the means (obtained in the ANOVA or 
ANCOVA)- with critical value required for significance. If H Q cannot be 
rejected, one probably should not look further for mean differences, 
although this is a logical rather than a purely statistical consideration. 
If the omnibus F is not significant, it is tantambnt to concluding all 
differences among all means is attributable to random sampling error. 

6. Select the base of a (contrast- or experiment). The Tukey, Scheffe, Dunn, 
Dunnett, and Marascuilo MC tests use the experiment as base, hence a type-I 
er^or will be made in only 5% of the experiments (if a = .05). The Newman - 
Keuls method employs the comparison as the unit, therefore, a type-I 
error can be expected for 5% of the contrasts . This is equivalent to 
saying more type-I errors for differences between pairsW means will be 
made with the Newman-Keuls procedure, but fewer type-II errors than with 
the experiment-based methods. Hence, if only pair-wise comparisons are 
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involved and the contrast is the base for u, use the Newman-Keuls 
method. (For unequal sample sizes for the Newman-Keuls or Tukey 

v. 

methods see Steel and Torrie, 1960, p. 114 or Fryer, 1966, p. 274). 

7. If the number of hypotheses to be tested is less than .J(J - l)/4, the 
Dunn (1961) method will usually be more powerful than either the Tukey 
or the Scheffe method. (Tables 4-6 in Dunn's (1961) article provide 
precise figures for various J, a, and df g combinations for which the Dunn 
methods would be more powerful). The special tables of critical values for 
uhe Dunn test are available in Dunn (1961), Miller (1966), Kirk (1968), 

and Myers (1972). If all J(J - l)/2 pairwise comparisons are of interest, 
as is usually the case, the Tukey method should be used since it is more 
powerful than the Dunn and Scheffe methods under such conditions (Scheffe,' 
1959, p. 76). 

8. If ^comparisons between complex combinations of means are desired, the 
Scheffe method has more power than the Tukey. 

The most rigorous and comprehensive treatment of the statistical 
properties underlying multiple comparison procedures is found in Miller (1966). 
Tne reader will find quite complete treatments in Kirk (1968), and Winer (1971). 
Articles by Duncan (1965) and Sparks (1963) provide useful computational 
comparisons. If one is doing multiple comparisons following an ANCOVA it is 
important to remember that adjustments must be made in the mean square 
error term (e.g., see Winer, 1971, p. 772). , 

The purpose of this article was to illustrate the importance of selecting 

the appropriate statistical model that best fits the experimental methods and 

hypotheses of interest. The schema provided was designed to encourage the reader 

2Duncan's New Multiple Ra^g^Test is not included here since the oxperimenter- 
selected a-value is correctonly for adjacently-ordered means; tin actual a-value 
always exceeds the selected value in all other contrasts (cf. Edwards, 1968, p. 
134-135). In addition, mathematical statisticians are not in agreement regarding 
the validity of certain assumptions employed in its derivation (Scheffe, 1959, 
p. 78; Duncan, 1965, p. 178). 
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V 

to cdnsider critical factors that will determine the se!ection of the optimum 
multiple comparison method for the hypotheses of interest in a given study. 
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Instructional Exercises ^ 
Which multiple comparison technique is preferable: 

1. for testing several correlation coefficients for significant differences? 
Marascuilo . 

2. for comparing each of several means with the mean of the control group? 
Dunnett 

3. when, although there are ten treatment groups, only twelve hypotheses 
are to be tested? 

Dunn 

4. forjudging all possible pairwise contrasts among means with a contrast- 
based error rate? 

Newman -KeuTs 

5. for making all possible pairwise contrasts among means with an experiment- 
based error rate? t 

Tukey y 

6. for data snooping -- making post hoc complex contrasts involving means? 

Scheffe 

7. for comparing means when variances are extremely heterogeneous? 

Marascuilo 
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II. Multiple Comparisons Computation 

Multiple comparisons are a loosely-related family of techniques for identifying 
significant differences among a set of three or more means. There are eight 
principal methods but only three different computational procedures; three employ 
the t-tests (multiple t, Dunn (or Bonferroni), and Dunnett); three use the 
studentized range statistic, q, (Tukey, Newman-Keuls, and Duncan); and two 
employ the F-statistic (Scheffe and planned orthogonal contrasts). In the 
discussion to follow, it is assumed that the usual ANOVA assumptions hold and 
that all means are based on the same number (n) of observations. 

The t-stati sties Approaches 

The multiple t, Dunn, and Dunnett methods are computationally identical 
(for. a given H_ ) , except that the critical t-values required to reject H_ will 

differ. The amount of difference is highly related to the number of means, 

0, being compared." (If J - 2, all methods give identical results, but, of course, 

are unnecessary.) 

Suppose there are six groups of 11 subjects each that are compared on 
some measure. The analysis of variance (ANOVA) revealed that H Q : = ^ = ••• 

= p 6 is not tenable, and hence rejected. The ANOVA table is given below: 

Source of Variation df MS F 



Treatments 5 176 8.0 

Error 60 22 



But which H Q : ^ = are tenable? To test each H Q ,' compute the t-ratio, 

H-l /2MS~~ 
' 7-TT 1 



t = 



For simplicity, select I. to be larger than R.. MS is the error term from the 

analysis of variance and n is the number of observations on which each mean is 
based. 

x\ - x. x\ - x. X. - L 

In this example: .t = j = — ^ = * ■ 



Is the t-value large enough to be significant, i.e., to reject H ? The 
non-directional H Q is rejected at the cx-level if the observed t-value exceeds 

the critical t-values shown below. 



r 
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For Multiple-t For Dunn a For Punnet t 

General expression: . /0 t r , t ? , ;9 t. ( 

1- i/2 f g l-» c,f e l-./Z J,f e 

o»*, in our example 

* ,tn > - 05: " .975*60 : 2 -°° .9!»VoO " 7 .9/^0,60 = 2 "- 6j 

,9b*5,60 "* 2 - bu ^ 
.95*15,60 * 3,06 

Number of IL's * ' . 

0 

to be tested: J(J - 1) c J - 1 

2 

J Note Dunn' table presupposed that small value is subtracted from larger, hence 
tne tabled .90. values are actually the .975 point in the cumulative distribution.) 



Although the three approaches arrive at identical t-values for a given 
contrast, they will usually differ greatly in the cri tical t-values needed to 
reject H Tne multiple-t, although widely used will result in many type-I 
errors (i.e., rejecting true H Q 's) and is never recommended as the method of 

choice. 

Tne Dunr.ett is appropriate only when one wishes to compare each of the 
.1 - 1 groups with one other predesigned groups -- usually the control group. 
In most instances the researcher wishes to compare each mean with every other mean, 
nence tne Dunnett rarely addresses many of the investigator's questions. 

Tne Dunn test requires that the researcher have planned in advance which 
lo;:iuj» i i jns fie is going to - make. The number of these planned contrasts, c, 
aftecti the critical t-value as would be expected the larger the value of 
c, th'; larger the critical value of t. In our example, the critical t-values 
are 2?tC anj 3.06 for c = 5 and c - 15 respectively. If the researcher wishes 
to r e'.'. all J(J - l)/2 pairwise comparisons (.15 in our example), the Dunn 
pt olv iw ■•: • j not as powerful as other alternatives to be considered later. 

Studen t! z ed Range (g)" Methods 

f he~Tukey , Newman-Keul s , and Duncan methods are computationally identical 
except that the critical q-values for a given H will usually differ. The 
studentized range statistic, q, is: 



I. - X. R. - X. 
q = -J I , J J 

In the example: 
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q .vji. l ±:h 



m 
Jn 



Is the q-value large enough to be significant, i.e., to reject H ? The non- 
directional H } : ... = . . is related at the ^-levels if the observed q-value 
exceed* tne critical q-values given below. 

For T ukey For Newman-Keul s For Duncan 

General ex^sion: ]-Aj,f e 1- » q r,f e l-"V' q r,f e 

where r is the number of 
means in the subset being 
eval-uated 

or, in example 

wftn = .05: 9 ^ SM = 4.16 gb q 6 ^ = 4.16 -.95^6,60 = 3 * 19 

.95 q 5,G0 3,98 ".9b" q 5,60 = 3,14 

.9b q 4,60 = 3,74 ".95 ,,q 4,60 = 3,07 

- " .95 q 3,60 = 3<4 ° ".95" q 3,60 = 2,98 

.95 q 2,60 = 2,83 ".95 ,,q 2,60 = 2,83 

1 ..cf-:oJ, the critical value for q_is constant for all H 's in the set. 

Km t'u> :,L-wn:dn-Keiils method, the largest L - X, is tested first, hence 
trv.'iv a-> J i.vans D.eing considered and the critical q-value is identical with 
fi.a *i-f tne I.ikcy. If that is significant, the researcher proceeds to test 
t'tO .l- 1 irjfit n.edn difference, in which r - J - 1 and the critical q-value 
i, . wMich is smaller than when r - J. This procedure is continued 

\ V • investigator finds the largest mean differences in the subset 
i)t.'in^ fji-med to be non-significant at which time he does not continue 
testing further among the means contained in that particular non-significant 
subset of means . 

Tne Duncan Multiple Range test is a procedure identical with the Newman- 
Keul? except that the true a is always greater than the tabled value (except 
when r = 2) . This fluctuation in the true a-value is a featur^e most consider 
to.be undesirable. For this reason, many authori ties- never consider Duncan to 

be the "method of choice." 

The F-Di striL ution Methods 

The Scheffe and Planned Orthogonal Contrast (POC) methods are computationally 
identical for a given H , but differ in the critical F-value needed to reject 
H . Both estimate a contrast, <j>, by the expression: 



o 
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The H being tested is determined by the values the researcher selects for 

the C coefficients. Meaningful contrasts require the C's to sum to zero. 

For example to test H Q : * i^* and wi 1 1 be 1 and rl. For all pairwise 

contrasts, the C-values'for the two groups will be 1 and -1. (The values for 
•complex contrasts (comparisons involving three or more groups) are not 
considered in this section). The sum of squares (SS) and the means square for 
the contrast (since each contrast has one degree of freedom) is: 

1 ^1 C 2 C 2 

-=•+-=■♦...+ j 

n l n 2 n" ; 



J 



or for pairwise contrasts with equal n'b 



% - -V 1 - 

n 



!rt >^ u' od to test H Q : ^ = i.j, i.e., 



n 

(Note: F = t* - l/2q r for a given comparison) 

Is trie F-value large enough to be significant, i.e., to reject H ? The non- 
directional H : = p- is rejected at the a-level if the obtained F-ratio 
exceeds the critical values shown below: 



For Scheffe For POC 

General expression (J • 1)J F, , f , F, , 

'r-rx J-l,f e l,f £ 

Or, in our example 

with . . .05 F; 5 95 F 560 = 5(2.37) . 11.85 F , 95 Fj (60 = 4.00 



The critical value for the Scheffe test will usually be much larger than 
the corresponding, value for P0&. (In this case 11.85 vs. 4.0 or almost 3 times 
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larger for, POC) . However, POC can test only J - 1 H 's whereas Scheffe can 
be used for any number of conceivable H 's. In addition, like the Dunn test, 
the POC requires that the H 's to be tested, must be specified prior to the 
analysis. The J - 1 comparisons must also be orthogonal (i.e., independent). 
Contrasts will be orthogonal only 'when the products of the corresponding 



i> and t|>. , sum to zero, i.e. 



C la C lb 



C 2a C 2b 



C's for the two contrasts 
* C Ja C Jb - <*• 

Comparisons Among Methods for our Exam ple 

Although the degree of difference between the methods will vary, depending 
on J and n, the rank order of the magnitude of the differences in* means 



needed to reject H Q . 
give identicaVTnesults.) 



= Uj is 

Table 

number of H * s to be tested, c. 



reject H 



"J 



(in 



predictably (except when J = 2, when all methods 

2 gives the magnitude of L - X. needed to 
units) for the various r-values J and associated 



Table 2 

A Comparison of Mean Differences Needed to Reject H Q for 

PaiVwise Contrasts for Various Multiple Comparison Methods 
(in _ ^ units) when J = 6 and n = 11 



r = , number of means in subset 
being examined 



Multiple-t 15 2.00 2.00 2.00 2.00 

Dunnett 5 2.63 2.63 2.63 2.63 

Dunn 5 2.66 2.66 2.66 2.66 

Dunn , 15 3.06 3.06 3.06 3.06 

Tukey 15 2.94 2.94 2.94 2.94 

Newman-Keuls 15. 2.00 2.40 2.64 2.81 

Duncan " 15 2.00 2.11 2.17 2.22 

.Scheffe 15 3.44 3.44 3.44 3.44 

POC 5 2.00 2.00 2.00 2-.00 

C=number of pairwise H 0 's tested 

There are other important ways in which these me,tho{^j}iffer, one of which 
is tne basis for the a-error rate. Dunn, Dunnett, Tukey and Scheffe use the 
entire experiment for type I error rate, hence if a = .05, on the course the 
investigator will make a type I error in only 5% of the experiments he conducts. 
Newman-Keuls and POC use the individual contrast or H 's as the basis for a, 
hence the inbestigator will make a type I error in 5% of the H 's he tests. 
Other differences are summarized in Table 3. 
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Depicting Multiple Comparison Results 

There are J(J - l}/2 possible patrwise comparisons. If J is large, for 
example, 10, then there are 55 hypotheses considered. A parsimonious method of 
depicting the results is commonly used — the underscoring procedure. The 
groups are arranged in order of their nuans, from low- to high. Then each non- 
significant subgroup is underscored — any two means underscored by a common 
line do not differ significantly. Note the example below: " 



Group 
2 3 



Group 5 differs significantly from groups 1 and 2. 

Group 4 differs significantly from. groups 1 and 2. 

Group 3 does not differ significantly from any group. 

Group 2 differs significantly from groups 4 and 5. 

Group 1 differs significantly from groups 4 and 5. 



f 
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Mastery Test: Multiple-Comparison Contrasts 

In a study at*Cornell University, objectivity ratings given to 10 puol icar ions 
were compared. The dependent variables, whose means are shown in the table 
below, represent ratings totaled over 1J well defined c-scects uf object ivit), 
4 r-aters and 3 news events (topics). * • 



Overall Objectivity of All Publ i cation^ on All Topics 

Periodical 

'1 



OS) 



» jr. S 



o; 
>- . 

at , 



■ii i 



1.3/ 

Must 



q . 



*3 



-XL 



a 



*4 



*5 



T3 

C 
TO 



o 

CD 



l.bO 1.76 1.99 2.10 



^6 



a; 



*7 

> 

CD 
U> 

O 

c 

U 



.29 2.38 



*8 




a 




#9 














*3 




0) 


Cx 




E 


OJ 






a: 




*— 


















O 



T3 I— 
C 3 
<0 CD 



2 
> 

a* 
re: 



CO 

c 

o 



2.44 2.89 3.36 
Least Objective 



Time (i*4) is significantly 



^s 



According to the results in Table 1 
d. :iore objective than publications 
ii. objective than publications P*s. 

According to the results in Table I, Our Times is significantly 
■j. nice objective than publication *'s. 
d. less objective than oubl ication *'s. 

T "i o statistical tecnni^ues used to obtain the results shown in the Table 
were that of 

j . planned orthogonal comparisons. b. post-hoc comparisons, 
riad multiple t-tests been used to compare all the possible pairwise 
differences, would more "significant" comparisons have resulted" 
In a given experiment with several groups, which multiple comparison 
•method will require the largest difference between pairs of means in 
order to reject the null hypothesis; and hence 'signi fy the fewest 
significant differences? 

a. TuKey b. Scheffe c. Newman-Ke^ls 



An-.,ver->: 1(a) : 7-10; 1(b) ; 1,2; 2(aJ_: 10; 2(b) : 1-8; 3: b; 4: yes; 5: b 
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Multiple Comparisons -- Problem Sets and Notes 
Given one fixed ANOVA factor with five treatment levels 



Groups 
I 2 3 4 5 
jj. 17 43 51 54 65 

n: 9 9 9 9 9 

An ANOVA Summary table is given below 

SS df MS F 



Source of 
Vari ation 



"00" 



Treatments £5215 4 
Error 4QQQ 40 100 

.99 F 4 t 40 = 3,83 

1. How many planned orthogonal contrasts (POC) are possible? 

2. Could ycu have legitimately inspected the means prior to 
your selection of the orthogonal contrasts of interest? 

Assume the following definition of the five randomly- 
assigned groups. 

Abbreviation 

1. Control (C) 

2. infrequently tested pupils without 
feedback (I. no) 

3. frequently tested pupils with 
negative feedback (F,-) 

4. i r. frequently tested pupils with 
.OiUive reeCDack (I, + ) 
frequently tested pupils with 

.io r . ltivt feedback (F,--0 



SUGGLSTION: Use a separate 
piece. of paper to do 
calculations. Place it 
over the right-hand side of 
these sheets, where the 
answers are giyen. 



ANSWERS 



^V?cti:^ for toe five groups are given below. Suppose you 
wisneu to test H : ^ = .c Enter the coefficients 



for tn:> ontrasi 



-5* 



C 
1 

x. 37 



I ,no 

2 



Group 



L 



F.- 

3 



48 

= r 



51 



I.+ 
4 

vr 



F.+ 
5 



65 



0" '3 '5 

4. Suppose you also had good reason to test H ( 
Would this oe orthogonal with i^? 

5. Why? 



~4 



No, a priori rationale 
would no longer apply. 



0, 0, K 0, -1 (or 0, 0, 
-1, 0, 1) i.e., 



«"1 = u 3 " p £ 



= 0 



No 



zee f 0; the sum of 
products of the respective 
coefficients for the groups 
must be zero. 
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6. In addition to indicate coefficients for the contrast 
for frequently and infrequently tested groups (Jg)- , 

7. Wnat is H for L which has 4,-1,-1,-1,-1, as 

coefficients? 



8. Is . ; orthogonal with i^?...with 



9. Compute MS SS = t^M- 
♦1 U * y c£ 
n 

contrast has df = 1. 



= MS? since each 



10. Compute F for the contrast t|u 



11. For planned orthogonal contrasts (POC) the critical 
^ value, g^Fj, in the sample problem is 



12. I: H rejected with ,1 = .05. 



13. would you recommend the POC as the multiple comparison 
technique in this example. 

14. Wn,? 



lb. a. tne Scheffe method (S-method), would you use 

tne identical procedure tor obtaining SS- as that for 



POC- 

b. MS for the contrasts?. 

c. Obtained F for the contrast? 

d. The "critical "value for F? 



1 



0,1,-1,1,-1. (or 
0,-1,1,-1,1) 

H o :m 1 = ^2 +M 3 +;J 4 +u 5^ 4 
or 4^ 1 =u2 + i | 3 + ^4 + ^5 

yes, 2cc = 0; 
yes, Ecc = 0 



MS 



F = 



1» 



MS" 



882 

1 = _ = 8.82 
- 100 



.95 r l,40 



= 4.08 



.95 



F = 4.08 < 



8.82 



yes, H Q rejected 



probably not 



Only selected contrasts 
of ip teres t could be 
legitimately evaluated. V 

Yes 



Yes 
Yes 

No, critical F is 

(J-l)i F i 1 * f or Scheffe, 
'1-a J-l,f e 

but , F, f for POC. 
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16. For the same H using Scheffe method the critical F- 
value is (J-l) (. 95 Fj. ltfe ) or ( )(.9 5 ^ 40 ) * 

'(}() = 10.44. 

17. How does this critical value compare with that for 
orthogonal comparisons, assuming the contrast had been 
planned? 

18. Does the S-method require orthogonality? 



19. Does it give a contrast-based type I error rate? ✓ 

20. For tne POC would the critical F-value (4.08) be the 
same for all four (J - 1) possible comparisons? 

21. Would the critical, value for F with the S-method also 
De constant for all of the possible contrasts. 

22. Wnat is the probability that one or more of the 
comparisons will yield an F > (J-l)( oc^i i f ) wnen 
y-» is true? * * e 



23. Had the experimenters selected the Tukey method, this 
distribution theory is no longer based on the F-model 
but on the 

24. Unlike tne t which uses t = _\ (° r Xj-^/s^, 
wmnr-rrj-^-n^ ) a s t h e-critical comparison ; Tukey "and • 



Newman-Keuls use q - X^^/s^ as the critical ratio on 

wnich tne' distribution theory is based. We should 
expect, then, that when J = 2, since 



X 2 ™ X*} 



^1 "*5(o 
1 c 



, that t ■ 



r 



25. You recall that wnen J = 2, F = t 2 , or t = J 

26. For tne Tukey method the critical value (a = .05) 
for eacn comparison would be g^qj ^ or g^q 



4 <,95 F 4,4o' ' 4 < 2 - 6I > 



It is much larger, 10.44 
vs. 4.08. 



No, any conceivable contrast 
is allowable. 

No, an experiment-wise rate 
Yes 

Yes 

.05 



studentized range. 



t = -9- or q = 
t = /F 



.95 q 5,40 s 4,04 
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\27. Since q„ = - , a value of so is required that is 
r s x 

independent of treatment effects. » Recall the symbol 
"MS" is another symbol for s 2 and that s» and s* are 
related as shown in the equation from elementary 
statistics: 



28. 



The n in the above equation is the number of subjects in 
each group or level, in this case . 



29. Therefore, s© = 



V _ 

s 



or 



"30. 



Since s^ and the critical q- value are the same for all 

Tukey multiple comparisons, the equWion can be 
rearranged so that the minimum significant di f ferences 
(designated "honest significant differences — HSD") 
between a pair of means, HSD ■ q^i f s- = ( ) ( ) = 
13.45 ^ J,T e x 

31. Therefore, in using the Tukey method, every difference 
between pairs of means greater than 13.45 would be judged 
significant, and H rejected at the level. 



The treatment means and the matrix of pairwise differences 
between treatment means are given below: 

3 7 ~4 " 
5T 3* 



37 Va 

Mean differences 
2 3 4 



5 
65 



1 



4 

3 



7 18 
7 17 
3 14 
11 



32. For which differences would H be rejected at .05 using 



the Tukey (HSD) method? 



33. Did the S-method reject H 0 :y 5 =u 3 ? 



/Too" 



=* /nrr or 3.33 



(4.04)(3.33) 



.05 



J 




Reject y| = yg» ^2 =v b* 
y 3 =y 5 



No, (cf items 9, ; 10, 16, 
F=9.8 < 10.44) 
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34. The latter computation illustrates the typical relation- 
ship between the power of the T and S methods. In 

comparing pairs of means the method is more 

efficient and powerful. On the other hand, the S- 
method is more sensitive in evaluatingxcomplex hypo- 
theses, e.g., H Q : - (i^+u^+MgJ/S 



Tukey 



35. The Newman-Keuls (N-K) method, unlike the T and S 
methods, but like the orthogonal contrasts, bases its 
type I error rate on the individual comparison rather 
than on the . 

36.. The N-K always has J - 1 different critical values 
for q, or equivalently minimum critical mean 

differences. Critical q-values are given in tables 
of the studentized range statistic. 

37. q 5 (5,40) where the means are 5 steps apart = 

q^(4,40) where the means are 4 steps apart = 

q 3 (3.40) where the means are 3 steps apart = 

q 2 (2,40) where the means are 2 steps apart = . 



38. The minimum differences for r = 5, the extreme-most 

means, then is ixlentical with that for the 

method. This is always the case. 

39. Therefore, for r * 5, minimum mean difference = 13.45. 
r = 4, minimum mean difference = (3 J! 79)( ) = 12.62 

= 3, minimum mean difference - (3.44)13.33) - 11.46 
= 2, minimum mean difference = (2.86)(3.33) = 9.52 



r 
r 



40. Which H was rejected for N-K that was not with the 
T-metho8? H Q : 

41. Complete the summary figure (any two means not 
underlined by the same line differ significantly 
at the .05 level). 

Treatments 

1 2 3 4 5 



experiment 



J. - 1 

4.04 
3.79 
3.44 
2.86 



Tukey 



3.33 



V p 4 = u 5 



S-method 
T-method 



T: 1 



N-K method 



N-K: 1 
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42. The Dunnett uses the 
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as the base for a-error. 



x 43. One group (usually the 



' ) is compared with each 
and every other group (or level). 



44. In essence the t-ratio is computed, i.e., 



TP" A=F 



c _ \ ~h 

~ 4.71 



45. Since both the critical Dunnett t and s 



are the 



constant for all comparisons, determinin§ the C minimum 
mean differences will expedite computation, i.e., 
Min(X £ - y « . 975 t J>fe S XE . x c - o 75 t(5,40) SxE . ^ . 

( )(4.71) • 12.58 r 

46. How does this compare with the critical mean differences? 

a. for the Tukey method? 

b. for N-K? 



Confidence Intervals (We shall use a - .05) 



Planned Contrasts ' 2 



} TT± nr f, r M$7z§-" where i - id 
e 



♦ 1 -95 r l,f e ™e*n 

The nature of the confidence is more apparent when we 
limit the C.I. to the difference between a pair of means, 
hence: 



> 



i = * (-D(X 2 ) or Xj - X 2 

Therefore, for planned orthogonal Comparisons between 
pairs of means, the .95 C.I. is given by 



experiment 
control 



4.71 



(An essential difference 
in the Dunnett and 
methods M t" is that the 
critical t-value for the 
Dunnett considers the fact 
that there are J groups 
and J comparisons, not just 
two. 



2.67 



smaller 

larg er than t wo, smalle r 
than two 
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Scheffe: 



(J - «(.9S F J-l.f 

Dunn: 



Tukey ; 

*1 • *2 * <.95«J.f 

47. Notice that the confidence interval for the S -method 
will be larger than that for the orthogonal contrasts 
to the extent that their respective critical F-value 

di f fers : 



ffl~ffl = 3.23 is greater than A~^{fT or 
* 2.02 

48. in the present example 3.23 vs. 2.02 indicates the 
confidence interval using^he S-method is 

{of = ^ 5 times "greater than the X. I . for a 

' Planned Orthogonal 
Contrast (POC) 

The precision of the estimates can be seen from the relative value for the,. 95 
C.I. for the various multiple comparison approaches in estimating |u 3 - u 5 | . 
(c * number of hypotheses to be tested.) 



4( .95 F 4,40^ 



Orthogonal (c < 4) 18.06 
Tukey " 25.54 

Scheffe 28.86 



Dunn (c * 10) 26.56 
Dunn (c « 5) 24.22 
Dunnett (c = 4) 23.06 
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Instructional Exercises y 

If an experiment error rate for the probability of a type I error, a, is 
desired and hypotheses involve all and only pairs of means, one should 
select: \ 



Tukey 

Which of the methods can test all pairs of means and has a contrast error 
rate? 



Newman -Keuls fc 

If one were only interested in comparing each 5L, 5L, JL, and \ with- 1 
he would probably select 



Dunnett 

.Which method is most general and places fewest restrictions on the hypotheses 
that can be tested? 



Scheffe 

In which method will the actual probability of a type I error, a, usually 
be much larger than the tabled and reported a? 



Duncan 

When there are three or more comparison groups, when will Scheffe necessarily 
differ from planned orthogonal contrast? 

Tr~tn^ompTjttrrg-i} 

b. in calculating MS; 

i> 

c. in computing F 

d. in the appropriate critical F-value 

e. in the coefficients employed for a given contrast 



d 

When comparing the extreme-most means: 

a. T;key will be less powerful than Newman-Keuls 

b. Newman-Keuls will be less powerful than Tukey 

c Both will be equal in power 


c 

In the above situation, will both Tukey and Newman-Keuls be more powerful than 

Scheffe? 



yes 
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Will both Tukey and Newman-Keuls be more powerful than Dunn if all pairwise 
contrasts are to be made? 



yes • 

When J * 3, and n 3 > ng, the probability of a type II error would be greatest 
if one employed: . / 

a. Dunnett's technique 

b. Newman -Keuls technique 

c. Scheffe's technique 

d. Tukey' s technique 
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